Skip to main content
Glama

basic-memory

SPEC-12 OpenTelemetry Observability.md8.38 kB
# SPEC-12: OpenTelemetry Observability ## Why We need comprehensive observability for basic-memory-cloud to: - Track request flows across our multi-tenant architecture (MCP → Cloud → API services) - Debug performance issues and errors in production - Understand user behavior and system usage patterns - Correlate issues to specific tenants for targeted debugging - Monitor service health and latency across the distributed system Currently, we only have basic logging without request correlation or distributed tracing capabilities. ## What Implement OpenTelemetry instrumentation across all basic-memory-cloud services with: ### Core Requirements 1. **Distributed Tracing**: End-to-end request tracing from MCP gateway through to tenant API instances 2. **Tenant Correlation**: All traces tagged with tenant_id, user_id, and workos_user_id 3. **Service Identification**: Clear service naming and namespace separation 4. **Auto-instrumentation**: Automatic tracing for FastAPI, SQLAlchemy, HTTP clients 5. **Grafana Cloud Integration**: Direct OTLP export to Grafana Cloud Tempo ### Services to Instrument - **MCP Gateway** (basic-memory-mcp): Entry point with JWT extraction - **Cloud Service** (basic-memory-cloud): Provisioning and management operations - **API Service** (basic-memory-api): Tenant-specific instances - **Worker Processes** (ARQ workers): Background job processing ### Key Trace Attributes - `tenant.id`: UUID from UserProfile.tenant_id - `user.id`: WorkOS user identifier - `user.email`: User email for debugging - `service.name`: Specific service identifier - `service.namespace`: Environment (development/production) - `operation.type`: Business operation (provision/update/delete) - `tenant.app_name`: Fly.io app name for tenant instances ## How ### Phase 1: Setup OpenTelemetry SDK 1. Add OpenTelemetry dependencies to each service's pyproject.toml: ```python "opentelemetry-distro[otlp]>=1.29.0", "opentelemetry-instrumentation-fastapi>=0.50b0", "opentelemetry-instrumentation-httpx>=0.50b0", "opentelemetry-instrumentation-sqlalchemy>=0.50b0", "opentelemetry-instrumentation-logging>=0.50b0", ``` 2. Create shared telemetry initialization module (`apps/shared/telemetry.py`) 3. Configure Grafana Cloud OTLP endpoint via environment variables: ```bash OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp-gateway-prod-us-east-2.grafana.net/otlp OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic[token] OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf ``` ### Phase 2: Instrument MCP Gateway 1. Extract tenant context from AuthKit JWT in middleware 2. Create root span with tenant attributes 3. Propagate trace context to downstream services via headers ### Phase 3: Instrument Cloud Service 1. Continue trace from MCP gateway 2. Add operation-specific attributes (provisioning events) 3. Instrument ARQ worker jobs for async operations 4. Track Fly.io API calls and latency ### Phase 4: Instrument API Service 1. Extract tenant context from JWT 2. Add machine-specific metadata (instance ID, region) 3. Instrument database operations with SQLAlchemy 4. Track MCP protocol operations ### Phase 5: Configure and Deploy 1. Add OTLP configuration to `.env.example` and `.env.example.secrets` 2. Set Fly.io secrets for production deployment 3. Update Dockerfiles to use `opentelemetry-instrument` wrapper 4. Deploy to development environment first for testing ## How to Evaluate ### Success Criteria 1. **End-to-end traces visible in Grafana Cloud** showing complete request flow 2. **Tenant filtering works** - Can filter traces by tenant_id to see all requests for a user 3. **Service maps accurate** - Grafana shows correct service dependencies 4. **Performance overhead < 5%** - Minimal latency impact from instrumentation 5. **Error correlation** - Can trace errors back to specific tenant and operation ### Testing Checklist - [x] Single request creates connected trace across all services - [x] Tenant attributes present on all spans - [x] Background jobs (ARQ) appear in traces - [x] Database queries show in trace timeline - [x] HTTP calls to Fly.io API tracked - [x] Traces exported successfully to Grafana Cloud - [x] Can search traces by tenant_id in Grafana - [x] Service dependency graph shows correct flow ### Monitoring Success - All services reporting traces to Grafana Cloud - No OTLP export errors in logs - Trace sampling working correctly (if implemented) - Resource usage acceptable (CPU/memory) ## Dependencies - Grafana Cloud account with OTLP endpoint configured - OpenTelemetry Python SDK v1.29.0+ - FastAPI instrumentation compatibility - Network access from Fly.io to Grafana Cloud ## Implementation Assignment **Recommended Agent**: python-developer - Requires Python/FastAPI expertise - Needs understanding of distributed systems - Must implement middleware and context propagation - Should understand OpenTelemetry SDK and instrumentation ## Follow-up Tasks ### Enhanced Log Correlation While basic trace-to-log correlation works automatically via OpenTelemetry logging instrumentation, consider adding structured logging for improved log filtering: 1. **Structured Logging Context**: Add `logger.bind()` calls to inject tenant/user context directly into log records 2. **Custom Loguru Formatter**: Extract OpenTelemetry span attributes for better log readability 3. **Direct Log Filtering**: Enable searching logs directly by tenant_id, workflow_id without going through traces This would complement the existing automatic trace correlation and provide better log search capabilities. ## Alternative Solution: Logfire After implementing OpenTelemetry with Grafana Cloud, we discovered limitations in the observability experience: - Traces work but lack useful context without correlated logs - Setting up log correlation with Grafana is complex and requires additional infrastructure - The developer experience for Python observability is suboptimal ### Logfire Evaluation **Pydantic Logfire** offers a compelling alternative that addresses your specific requirements: #### Core Requirements Match - ✅ **User Activity Tracking**: Automatic request tracing with business context - ✅ **Error Monitoring**: Built-in exception tracking with full context - ✅ **Performance Metrics**: Automatic latency and performance monitoring - ✅ **Request Tracing**: Native distributed tracing across services - ✅ **Log Correlation**: Seamless trace-to-log correlation without setup #### Key Advantages 1. **Python-First Design**: Built specifically for Python/FastAPI applications by the Pydantic team 2. **Simple Integration**: `pip install logfire` + `logfire.configure()` vs complex OTLP setup 3. **Automatic Correlation**: Logs automatically include trace context without manual configuration 4. **Real-time SQL Interface**: Query spans and logs using SQL with auto-completion 5. **Better Developer UX**: Purpose-built observability UI vs generic Grafana dashboards 6. **Loguru Integration**: `logger.configure(handlers=[logfire.loguru_handler()])` maintains existing logging #### Pricing Assessment - **Free Tier**: 10M spans/month (suitable for development and small production workloads) - **Transparent Pricing**: $1 per million spans/metrics after free tier - **No Hidden Costs**: No per-host fees, only usage-based metering - **Production Ready**: Recently exited beta, enterprise features available #### Migration Path The existing OpenTelemetry instrumentation is compatible - Logfire uses OpenTelemetry under the hood, so the current spans and attributes would work unchanged. ### Recommendation **Consider migrating to Logfire** for the following reasons: 1. It directly addresses the "next to useless" traces problem by providing integrated logs 2. Dramatically simpler setup and maintenance compared to Grafana Cloud + custom log correlation 3. Better ROI on observability investment with purpose-built Python tooling 4. Free tier sufficient for current development needs with clear scaling path The current Grafana Cloud implementation provides a solid foundation and could remain as a backup/export target, while Logfire becomes the primary observability platform. ## Status **Created**: 2024-01-28 **Status**: Completed (OpenTelemetry + Grafana Cloud) **Next Phase**: Evaluate Logfire migration **Priority**: High - Critical for production observability

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/basicmachines-co/basic-memory'

If you have feedback or need assistance with the MCP directory API, please join our Discord server